The equation for a linear regression model with one covariate looks something like this:
\[ y_i = \text{intercept} + \text{slope} \times x_i + \epsilon_i \]
where \(\epsilon\) represents noise, or measurement error, and is assumed to be normally distributed: \[ \epsilon_i \sim N(0, \sigma^2) \]
We also write this model: \[ y_i = \text{intercept} + \text{slope} \times x_i + \epsilon_i \]
\[ \epsilon_i \sim N(0, \sigma^2) \]
like this: \[ y_i \sim N(\mu_i, \sigma^2) \]
\[ \mu_i = \text{intercept} + \text{slope} \times x_i \] where \(\mu_i\) is the value of \(y_i\) we would expect to see if there wasn’t any noise
The linear model assumes that \(y\) are: - continuous (any number from \(-\infty\) to \(\infty\)) - and normally distributed (given \(\mu\))
But our data often aren’t!
We could transform (AKA torture) these data to make them look normal-ish. But:
We can use other distributions for \(y\) instead. E.g.:
We can use other distributions for \(y\) instead. E.g.:
We can use other distributions for \(y\) instead. E.g.:
So for count data, instead of: \[ y_i \sim N(\mu_i, \sigma^2) \] \[ \mu_i = \text{intercept} + \text{slope} \times x_i \]
we can do:
\[
y_i \sim Poisson(\lambda_i)
\] \[
\lambda_i = e^{\mu_i}
\] \[
\mu_i = \text{intercept} + \text{slope} \times x_i
\] \(\lambda_i\) is the value of \(y_i\) we’d expect on average. We need \(\lambda\) to be positive, so we squish it through a function (exp())
The same trick for binary data:
\[ y_i \sim Bernoulli(p_i) \] \[ p_i = \frac{1}{1 + e^{-\mu_i}} \] \[ \mu_i = \text{intercept} + \text{slope} \times x_i \]
where \(p_i\) is the value of \(y_i\) we would expect to see on average. We need to ensure \(p\) is between 0 and 1, so we squish it through a function (the logistic function).
The general trick:
see ?family in R for some options
binomial(link = "logit")gaussian(link = "identity")Gamma(link = "inverse")inverse.gaussian(link = "1/mu^2")poisson(link = "log")
Our count, binary and other data fit the assumptions of these models!
They are more powerful (and unbiased) than transforming data and using linear models.
Our paper will might be accepted!
But these models aren’t linear, so:
We can compute the likelihood of a model - probability of observing the data, given the parameters.
x <- c(-1.38, 4.07, 0.84); y <- c(4, 0, 1)
likelihood <- function (intercept, slope) {
mu <- intercept + slope * x
lambda <- exp(mu)
prod(dpois(y, lambda))
}likelihood(0, -0.5)## [1] 0.02679616
likelihood(0, -1)## [1] 0.05383974
We fit the models by trying different values of the parameters, and find the parameters that give the biggest likelihood.
objective <- function(params) {
-1 * likelihood(params[1], params[2])
}
optim(c(intercept = 0, slope = 0), objective)$par## intercept slope
## 0.4053731 -0.7307372
coef(glm(y ~ x, family = poisson))## (Intercept) x
## 0.4053842 -0.7307279
We can compare models using likelihoods too:
We can only compare different sets of parameters, so: - we cannot compare models fitted to different datasets - we cannot compare models with different distributions
but: - we can compare models with different covariates, because that’s the same as setting their coefficients to \(0\) - models that are the same, but with some coefficients set to 0 are called nested
We can compare our glm with an intercept-only glm (which assumes no effect of x)
full_model_lik <- likelihood(0.4053, -0.7307)intercept_only_lik <- likelihood(0.5108, 0)full_model_lik / intercept_only_lik## [1] 18.04014
the data are 18 times more likely under the model with the covariate
If we add more parameters (letting the slope vary) we would expect to do better at explaining the data, even if the covariate isn’t actually related to the response.
Statistical theory tells us that if the covariates had no effect, ( \(-2\) times) the difference in the log-likelihoods follows a Chi-squared distribution, with degrees of freedom given by the number of extra parameters
log_diff <- (log(intercept_only_lik) - log(full_model_lik))
-2 * log_diff## [1] 5.785199
m_full <- glm(y ~ x, family = poisson)
m_int <- glm(y ~ 1, family = poisson)
anova(m_int, m_full)## Analysis of Deviance Table
##
## Model 1: y ~ 1
## Model 2: y ~ x
## Resid. Df Resid. Dev Df Deviance
## 1 2 5.9821
## 2 1 0.1969 1 5.7852
We can compute the chance of seeing at least this big a difference under this chi-squared distribution:
1 - pchisq(-2 * log_diff, 1)## [1] 0.01616167
lmtest::lrtest(m_int, m_full)## Likelihood ratio test
##
## Model 1: y ~ 1
## Model 2: y ~ x
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 1 -5.6239
## 2 2 -2.7313 1 5.7852 0.01616 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can also compute metrics to rank different models, whilst accounting for overfitting. E.g. the Akaike Information Criterion:
\[ AIC = -2 \times ln(\hat{L}) + 2 \times k \]
where \(k\) is the number of parameters. This trades off goodness of fit ( \(-2 \times ln(\hat{L})\)) against model complexity ( \(2 \times k\) ). Smaller values of AIC (high likelihood & low complexity) imply a ‘better’ models, but they are only relative measures, they don’t tell you how good your model is.
AIC(m_full)## [1] 9.462653
AIC(m_int)## [1] 13.24785
again, the full model seems better, even though it has more parameters
Both likelihood ratios and information criteria are approximations of how much the model is overfitting to the data, based on some assumptions.
A more robust alternative is to calculate the likelihood against some data that weren’t used to fit the model
If you don’t have an obvious second set of data to do this external validation, you can chop your data up into smaller datasets…
[image from Wikipedia]
This takes longer than the alternative approaches (you need to re-fit the model lots of times)
It might not work well if you have very little data, since each of the datasets you are fitting with will be smaller than the whole one.
You might also need to think hard about how you split your data into folds. Maybe it’s a better test to split by species, geographic area or year than splitting randomly?
(AKA LOOCV) The same as K-fold cross-validation, but where \(K = n\), so each datapoint is its own fold.
df <- data.frame(y = y, x = x)
cost <- function (y, yhat) {
-2 * dpois(y, yhat, log = TRUE)
}boot::cv.glm(df, m_full, cost = cost)$delta[1]## [1] 3027463
boot::cv.glm(df, m_int, cost = cost)$delta[1]## [1] 6.83833
It looks like the full model did much worse than the intercept-only model. But that’s not surprising as each one was fitted using only two datapoints!